Data Analysis of Burglaries in Chicago (2015-2019)

By Annivas Exarchos

chicago

source: https://hotelemc2.com/why-chicago-is-the-best-city-in-the-world/

Introduction

The overall objective of this project will be to analyze burglary data for Chicago, IL from 2015 to 2019.

Throughout this tutorial, we will attempt to find when and where burglaries are most likely to take place, while also complementing our analysis with interesting burglary trends and statistics.

Required Tools

This project is written in python 3.91.

You will need to the following python libraries:

Installations & Imports

In [2]:
!pip install folium # install to create maps
Requirement already satisfied: folium in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: requests in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from folium) (2.24.0)
Requirement already satisfied: branca>=0.3.0 in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from folium) (0.4.1)
Requirement already satisfied: jinja2>=2.9 in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from folium) (2.11.2)
Requirement already satisfied: numpy in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from folium) (1.18.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from requests->folium) (2020.6.20)
Requirement already satisfied: idna<3,>=2.5 in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from requests->folium) (1.25.9)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/annivas/opt/anaconda3/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import folium
from sodapy import Socrata
from folium.plugins import HeatMap

1. Data Collection

This is the first stage of the data lifecycle. Here, we will collect all the data needed for our project.

The main dataset that we will be using contains all reported crimes in the city of Chicago since 2001 and can be found in the official Chicago Data Portal.

The data is stored in a large csv file, which we will be accessing using the sodapy client through the Socrata Open Data API.

From this file we will only extract crime data for the years 2015-2019.

In [4]:
# These can be found in the data portal
domain = 'data.cityofchicago.org'
dataset_id = 'ijzp-q8t2'

# Generate token by creating an account for the data portal
token = 'Lkysyak9elTtcNXRVmfsj9YLX'

client = Socrata(domain, token)

# Get data for 2015-2019
results = client.get(dataset_id, where="date >= '2015-01-01' and date < '2020-01-01'", limit=2000000)

# Store into pandas dataframe
crime_table = pd.DataFrame.from_dict(results)

# Display dataframe
crime_table
Out[4]:
id case_number date block iucr primary_type description location_description arrest domestic ... ward community_area fbi_code x_coordinate y_coordinate year updated_on latitude longitude location
0 10225520 HY412735 2015-01-01T00:00:00.000 075XX S BLACKSTONE AVE 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300 RESIDENCE False False ... 5 43 11 1187511 1855334 2015 2018-02-10T15:50:01.000 41.758131167 -87.588352326 {'latitude': '41.758131167', 'longitude': '-87...
1 11028448 JA360336 2015-01-01T00:00:00.000 051XX W HURON ST 0281 CRIM SEXUAL ASSAULT NON-AGGRAVATED APARTMENT True True ... 37 25 02 NaN NaN 2015 2019-09-02T15:57:18.000 NaN NaN NaN
2 10225760 HY412902 2015-01-01T00:00:00.000 050XX N MARINE DR 0810 THEFT OVER $500 APARTMENT False False ... 48 3 06 1169650 1934124 2015 2018-02-10T15:50:01.000 41.974742888 -87.651517395 {'latitude': '41.974742888', 'longitude': '-87...
3 11242929 JB168310 2015-01-01T00:00:00.000 049XX S COTTAGE GROVE AVE 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300 APARTMENT False False ... 4 39 11 NaN NaN 2015 2018-03-01T15:54:55.000 NaN NaN NaN
4 10229179 HY416572 2015-01-01T00:00:00.000 039XX S LAKE PARK AVE 1752 OFFENSE INVOLVING CHILDREN AGG CRIM SEX ABUSE FAM MEMBER RESIDENCE False False ... 4 36 20 1183388 1878984 2015 2018-02-10T15:50:01.000 41.823125769 -87.602725951 {'latitude': '41.823125769', 'longitude': '-87...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1331526 11937967 JC567053 2019-12-31T23:46:00.000 034XX W JACKSON BLVD 143A WEAPONS VIOLATION UNLAWFUL POSS OF HANDGUN STREET False False ... 28 27 15 1153587 1898480 2019 2020-01-07T15:52:21.000 41.877268465 -87.711536692 {'latitude': '41.877268465', 'longitude': '-87...
1331527 11938240 JD100002 2019-12-31T23:48:00.000 004XX S CICERO AVE 143A WEAPONS VIOLATION UNLAWFUL POSS OF HANDGUN VEHICLE NON-COMMERCIAL True False ... 29 25 15 1144466 1897452 2019 2020-01-07T15:52:21.000 41.874623951 -87.745052647 {'latitude': '41.874623951', 'longitude': '-87...
1331528 11938857 JD100599 2019-12-31T23:50:00.000 004XX N Ashland ave 0820 THEFT $500 AND UNDER BAR OR TAVERN False False ... 27 24 06 NaN NaN 2019 2020-01-07T15:52:21.000 NaN NaN NaN
1331529 11940078 JD100016 2019-12-31T23:54:00.000 063XX S MAY ST 0420 BATTERY AGGRAVATED:KNIFE/CUTTING INSTR SIDEWALK False False ... 16 68 04B 1169736 1862855 2019 2020-01-08T15:47:27.000 41.779173667 -87.653277703 {'latitude': '41.779173667', 'longitude': '-87...
1331530 11938228 JD100017 2019-12-31T23:55:00.000 0000X W 69TH ST 143A WEAPONS VIOLATION UNLAWFUL POSS OF HANDGUN STREET True False ... 6 69 15 1176896 1859260 2019 2020-01-07T15:52:21.000 41.769150218 -87.627136786 {'latitude': '41.769150218', 'longitude': '-87...

1331531 rows × 22 columns

Additionally, we will be using some data downloaded from FBI's Crime Data Explorer in csv format. These burglary-specific datasets include statistics about victims' and offenders' age, sex, and race, as well the relationship between victims and offenders and other crimes that burglary offenders have been charged with.

To import this data into dataframes, we will be using pandas' read_csv method.

Burglary offenders by age:

In [14]:
offender_age = pd.read_csv("https://annivas.github.io/chicagoburglaries/files/offender-age-2015-2019.csv")
offender_age
Out[14]:
Key Value
0 0-9 4041
1 10-19 182462
2 20-29 306174
3 30-39 231868
4 40-49 104887
5 50-59 69272
6 60-69 10986
7 70-79 1744
8 80-89 496
9 90-Older 2269
10 Unknown 546607

Burglary offenders by sex:

In [7]:
offender_sex = pd.read_csv("https://annivas.github.io/chicagoburglaries/files/offender-sex-2015-2019.csv")
offender_sex
Out[7]:
Key Value Percent
0 Male 799186 0.548295
1 Female 185725 0.127420
2 Unknown 472674 0.324286

Burglary offenders by race:

In [8]:
offender_race = pd.read_csv("https://annivas.github.io/chicagoburglaries/files/offender-race-2015-2019.csv")
offender_race
Out[8]:
Key Value
0 Asian 6421
1 Native Hawaiian 0
2 Black or African American 293696
3 American Indian or Alaska Native 10945
4 White 618786
5 Unknown 526479

Burglary victims by age:

In [9]:
victim_age = pd.read_csv("https://annivas.github.io/chicagoburglaries/files/victim-age-2015-2019.csv")
victim_age
Out[9]:
Key Value
0 0-9 7600
1 10-19 70293
2 20-29 427706
3 30-39 442191
4 40-49 375548
5 50-59 370236
6 60-69 266728
7 70-79 130137
8 80-89 47165
9 90-Older 9419
10 Unknown 29590

Burglary victims by sex:

In [10]:
victim_sex = pd.read_csv("https://annivas.github.io/chicagoburglaries/files/victim-sex-2015-2019.csv")
victim_sex
Out[10]:
Key Value Percent
0 Male 1164541 0.535024
1 Female 996356 0.457755
2 Unknown 15716 0.007220

Burglary victims by race:

In [15]:
victim_race = pd.read_csv("https://annivas.github.io/chicagoburglaries/files/victim-race-2015-2019.csv")
victim_race
Out[15]:
Key Value
0 Asian 41819
1 Native Hawaiian 0
2 Black or African American 413438
3 American Indian or Alaska Native 10844
4 White 1607281
5 Unknown 99278

Relationship between burglary offenders and victims:

In [12]:
victim_offender_relationship = pd.read_csv("https://annivas.github.io/chicagoburglaries/files/victim-offender-relationship-2015-2019.csv")
victim_offender_relationship
Out[12]:
Key Value
0 Acquaintance 19045
1 Babysittee 45
2 Boyfriend/Girlfriend 13371
3 Child of Boyfriend/Girlfriend 190
4 Child 454
5 Employee 134
6 Employer 216
7 Friend 2814
8 Grandchild 14
9 Grandparent 388
10 Homosexual Relationship 328
11 In-Law 500
12 Neighbor 2840
13 Other Family Member 2312
14 Otherwise Known 14171
15 Parent 2025
16 Relationship Unknown 51544
17 Sibling 1020
18 Stepchild 101
19 Spouse 1856
20 Stepparent 263
21 Stepsibling 49
22 Stranger 26995
23 Offender 478
24 Ex Spouse 2706
25 Common Law Spouse 249

Other offenses linked to burglary offenders:

In [13]:
linked_offenses = pd.read_csv("https://annivas.github.io/chicagoburglaries/files/linked-offenses-2015-2019.csv")
linked_offenses
Out[13]:
Key Value
0 Identity Theft 1147
1 Fondling 982
2 Bribery 982
3 Stolen Property Offenses 12711
4 Pocket-Picking 340
5 Murder and Nonnegligent Manslaughter 344
6 Welfare Fraud 21
7 Extortion/Blackmail 109
8 Theft from Coin-operated Machine or Device 889
9 Statutory Rape 31
10 Intimidation 8189
11 Gambling Equipment Violation 2
12 Shoplifting 2697
13 Robbery 6808
14 Incest 0
15 Animal Cruelty 145
16 Destruction/Damage/Vandalism of Property 295271
17 Operating/Promoting/Assisting Gambling 6
18 Hacking/Computer Invasion 69
19 Sports Tampering 0
20 Kidnapping/Abduction 4240
21 Motor Vehicle Theft 29080
22 embezzlement 346
23 Impersonation 1743
24 Weapon Law Violations 7462
25 Negligent Manslaughter 3
26 Wire Fraud 71
27 Theft From Building 43694
28 Theft of Motor Vehicle Parts or Accessories 1915
29 False Pretenses/Swindle/Confidence game 4113
30 Drug Equipment Violations 8457
31 Sexual Assault with an Object 158
32 Counterfeiting/Forgery 2785
33 Purse-snatching 267
34 Prostitution 24
35 Credit Card/Automated Teller Machine Fraud 5315
36 Pornography/Obscene Material 79
37 Human Trafficking, Commercial Sex Acts 2
38 Betting/Wagering 0
39 Human Trafficking, Involuntary Servitude 0
40 Purchasing Prostitution 1
41 Theft from Motor Vehicle 20263
42 Drug/Narcotic Violations 15014
43 Aggravated Assault 17962
44 Simple Assault 31799
45 Assisting or Promoting Prostitution 12
46 Burglary/Breaking & Entering 0
47 Arson 2584
48 All Other Larceny 81300
49 Sodomy 253

2. Data Processing

Now that we have collected all the necessary data, it's time to process it and organize it in a way that will serve our needs for the remainder of the project.

First, let's extract all burglaries from the crime table into a new table. We will only choose the columns we need, as the initial crime table is filled with unnecessay information.

In [16]:
# Get burglaries from crime table (only the columns we need)
burglary_table = crime_table[['id', 'primary_type', 'description', 'arrest', 'location', 'latitude', 'longitude', 'date', 'year']].loc[crime_table['primary_type']=='BURGLARY']
# Display first 5 rows
burglary_table.head()
Out[16]:
id primary_type description arrest location latitude longitude date year
295 9913600 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.975141744', 'longitude': '-87... 41.975141744 -87.76454628 2015-01-01T00:01:00.000 2015
392 9911420 BURGLARY FORCIBLE ENTRY False {'latitude': '41.882806623', 'longitude': '-87... 41.882806623 -87.705030858 2015-01-01T01:00:00.000 2015
450 9911551 BURGLARY FORCIBLE ENTRY False {'latitude': '41.901003465', 'longitude': '-87... 41.901003465 -87.649282987 2015-01-01T02:00:00.000 2015
455 9914392 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.873236088', 'longitude': '-87... 41.873236088 -87.740923693 2015-01-01T02:00:00.000 2015
585 9911384 BURGLARY FORCIBLE ENTRY False {'latitude': '41.996215265', 'longitude': '-87... 41.996215265 -87.716862144 2015-01-01T04:50:00.000 2015

Now that we have just the data we need, let's add some new columns deriving from the "date" column

Create month column:

In [18]:
# Months are represented as ints from 1 (January) to 12 (December).
# We could represent months as strings, but integers facilitate plotting.
burglary_table['month'] = pd.DatetimeIndex(burglary_table['date']).month
# Display first 5 rows
burglary_table.head()
Out[18]:
id primary_type description arrest location latitude longitude date year month
295 9913600 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.975141744', 'longitude': '-87... 41.975141744 -87.76454628 2015-01-01T00:01:00.000 2015 1
392 9911420 BURGLARY FORCIBLE ENTRY False {'latitude': '41.882806623', 'longitude': '-87... 41.882806623 -87.705030858 2015-01-01T01:00:00.000 2015 1
450 9911551 BURGLARY FORCIBLE ENTRY False {'latitude': '41.901003465', 'longitude': '-87... 41.901003465 -87.649282987 2015-01-01T02:00:00.000 2015 1
455 9914392 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.873236088', 'longitude': '-87... 41.873236088 -87.740923693 2015-01-01T02:00:00.000 2015 1
585 9911384 BURGLARY FORCIBLE ENTRY False {'latitude': '41.996215265', 'longitude': '-87... 41.996215265 -87.716862144 2015-01-01T04:50:00.000 2015 1

Create day column:

In [19]:
# Days are represented as ints from 0 (Monday) to 6 (Sunday).
# We could represent days as strings, but integers facilitate plotting.
burglary_table['day'] = pd.DatetimeIndex(burglary_table['date']).weekday
# Display first 5 rows
burglary_table.head()
Out[19]:
id primary_type description arrest location latitude longitude date year month day
295 9913600 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.975141744', 'longitude': '-87... 41.975141744 -87.76454628 2015-01-01T00:01:00.000 2015 1 3
392 9911420 BURGLARY FORCIBLE ENTRY False {'latitude': '41.882806623', 'longitude': '-87... 41.882806623 -87.705030858 2015-01-01T01:00:00.000 2015 1 3
450 9911551 BURGLARY FORCIBLE ENTRY False {'latitude': '41.901003465', 'longitude': '-87... 41.901003465 -87.649282987 2015-01-01T02:00:00.000 2015 1 3
455 9914392 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.873236088', 'longitude': '-87... 41.873236088 -87.740923693 2015-01-01T02:00:00.000 2015 1 3
585 9911384 BURGLARY FORCIBLE ENTRY False {'latitude': '41.996215265', 'longitude': '-87... 41.996215265 -87.716862144 2015-01-01T04:50:00.000 2015 1 3

Create time column:

In [20]:
# Time is expressed in hours and hours are represented as ints from 0 (12 am) to 23 (11 pm)
burglary_table['time'] = pd.DatetimeIndex(burglary_table['date']).hour
# Display first 5 rows
burglary_table.head()
Out[20]:
id primary_type description arrest location latitude longitude date year month day time
295 9913600 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.975141744', 'longitude': '-87... 41.975141744 -87.76454628 2015-01-01T00:01:00.000 2015 1 3 0
392 9911420 BURGLARY FORCIBLE ENTRY False {'latitude': '41.882806623', 'longitude': '-87... 41.882806623 -87.705030858 2015-01-01T01:00:00.000 2015 1 3 1
450 9911551 BURGLARY FORCIBLE ENTRY False {'latitude': '41.901003465', 'longitude': '-87... 41.901003465 -87.649282987 2015-01-01T02:00:00.000 2015 1 3 2
455 9914392 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.873236088', 'longitude': '-87... 41.873236088 -87.740923693 2015-01-01T02:00:00.000 2015 1 3 2
585 9911384 BURGLARY FORCIBLE ENTRY False {'latitude': '41.996215265', 'longitude': '-87... 41.996215265 -87.716862144 2015-01-01T04:50:00.000 2015 1 3 4

For the complementary data we imported, the only processing that needs to be done is setting the "Key" column as the index of each table and sorting the tables by "Value" to facilitate plotting.

In [21]:
offender_age = offender_age.set_index('Key').sort_values(by="Value", ascending=False)
offender_sex = offender_sex.set_index('Key').sort_values(by="Value", ascending=False)
offender_race = offender_race.set_index('Key').sort_values(by="Value", ascending=False)
victim_age = victim_age.set_index('Key').sort_values(by="Value", ascending=False)
victim_sex = victim_sex.set_index('Key').sort_values(by="Value", ascending=False)
victim_race = victim_race.set_index('Key').sort_values(by="Value", ascending=False)
victim_offender_relationship = victim_offender_relationship.set_index('Key').sort_values(by="Value", ascending=False)
linked_offenses = linked_offenses.set_index('Key').sort_values(by="Value", ascending=False)

3. Exploratory Data Analysis & Visualization

Now that our data is clean and organized, it's time to analyze it through the use of visualizations. This is usually the most interesting part of the data lifecycle, as we will attempt to plot our data and observe possible trends.

First, we will use the original crime table to measure the occurrences of each type of crime in the last 5 years.

In [22]:
# Caluculate number of each crime type occurrence in crime_table
crime_type_occ = crime_table['primary_type'].value_counts()
crime_type_occ
Out[22]:
THEFT                                311044
BATTERY                              247755
CRIMINAL DAMAGE                      143240
ASSAULT                               96114
DECEPTIVE PRACTICE                    93186
OTHER OFFENSE                         86123
NARCOTICS                             77581
BURGLARY                              61851
MOTOR VEHICLE THEFT                   51693
ROBBERY                               51149
CRIMINAL TRESPASS                     33247
WEAPONS VIOLATION                     23295
OFFENSE INVOLVING CHILDREN            11667
PUBLIC PEACE VIOLATION                 8419
CRIM SEXUAL ASSAULT                    6858
INTERFERENCE WITH PUBLIC OFFICER       6183
SEX OFFENSE                            5542
PROSTITUTION                           4255
HOMICIDE                               3068
ARSON                                  2161
LIQUOR LAW VIOLATION                   1210
CRIMINAL SEXUAL ASSAULT                1080
GAMBLING                               1033
STALKING                                953
KIDNAPPING                              926
INTIMIDATION                            741
CONCEALED CARRY LICENSE VIOLATION       505
OBSCENITY                               331
NON-CRIMINAL                            142
HUMAN TRAFFICKING                        60
PUBLIC INDECENCY                         59
OTHER NARCOTIC VIOLATION                 29
NON - CRIMINAL                           25
NON-CRIMINAL (SUBJECT SPECIFIED)          6
Name: primary_type, dtype: int64

From the above data, theft looks to be the most common crime in Chicago, while burglary is 8th.

Now, let's plot the 13 most common types of crime in a pie chart to get a better idea.

In [37]:
crime_type_occ[0:13].plot(kind='pie', figsize=(10, 10), title="Types of Crime in Chicago (2015-2019)", autopct='%1.1f%%')
plt.ylabel("")
plt.show()

By using the burglary_table, we can plot the number of burglaries by year and hopefully observe a trend.

In [38]:
burglary_table['year'].value_counts().sort_index().plot(kind='bar', rot=0, title="Chicago Burglaries by Year", figsize=(10, 8))
plt.ylabel("Number of Burglaries")
plt.xlabel("Year")
plt.show()

From the above bar plot, we can tell that in the last 5 years, 2016 had the most burglaries. Most importantly, there seems to be a decreasing trend since 2016, meaning that the number of burglaries has only decreased since then.

Now let's try to visualize burglaries by month. By counting the occurrences of each month in our burglary table, we can get the average number of burglaries occurred by month throughout the last 5 years.

In [39]:
# Total number of burglaries for 5 years, grouped by month
burglaries_by_month = burglary_table['month'].value_counts().sort_index()
# Divide value for each month by 5 to get normalized number of burglaries per month
burglaries_by_month = burglaries_by_month.apply(lambda x: x/5)
burglaries_by_month
Out[39]:
1     1053.8
2      777.2
3      865.0
4      895.4
5     1006.2
6     1030.6
7     1156.0
8     1178.2
9     1103.8
10    1136.0
11    1076.4
12    1091.6
Name: month, dtype: float64
In [54]:
figure(num=None, figsize=(14, 8))
x = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
y = burglaries_by_month
plt.bar(x,y)
plt.title("Chicago Burglaries by Month")
plt.ylabel("Average Number of Burglaries")
plt.xlabel("Month")
plt.show()

After plotting the average number of burglaries per month, we can start observing some trends. The most burglaries occur in August (1178), followed by July (1156) , which could be attributed to the fact that many homes are left unoccupied during summer vacations, or that the warmer weather allows for more ideal burglary conditions (eg. open windows). February seems to have the lowest average number of burglaries (about 2/3 of August's burglaries), meaning that households are the safest during that month of the year. This could be because of the cold/snowy weather, but the numbers are also slightly affected by the fact that February is the shortest month of the year, so the burglaries in February should be naturally fewer than those in other months.

We can also plot the number of burglaries by day of the week.

In [41]:
# Total number of burglaries for 5 years, grouped by day of week
burglaries_by_day = burglary_table['day'].value_counts().sort_index()
# Divide value for each day by 5 to get number of burglaries for each year by day
burglaries_by_day = burglaries_by_day.apply(lambda x: x/5)
# Divide value for each day by 52.1429 (number of weeks in a year) to get normalized number of burglaries by day of week
burglaries_by_day = burglaries_by_day.apply(lambda x: x/52.1429)
burglaries_by_day
Out[41]:
0    36.165998
1    35.283807
2    35.475587
3    35.126546
4    37.458599
5    30.504632
6    27.221347
Name: day, dtype: float64
In [55]:
figure(num=None, figsize=(14, 8))
x = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
y = burglaries_by_day
plt.bar(x,y)
plt.title("Chicago Burglaries by Day of the Week")
plt.ylabel("Average Number of Burglaries")
plt.xlabel("Day")
plt.show()

There seem to be about 36 burglaries per weekday in Chicago, while the number is lower on weekends. The average number of burglaries on Saturdays is 30.5 and on Sundays 27.2. This could be attributed to the fact that most people are at work on weekdays, and empty houses make better targets for burglars. Weekends (especially Sundays) seem to be less suitable days for burglaries, as most families tend to stay at home.

Now let's dive a step deeper, and plot the number of burglaries by time of the day.

In [48]:
# Total number of burglaries for 5 years, grouped by hour in the day
burglaries_by_time = burglary_table['time'].value_counts().sort_index()
# Divide value for each hour by 5 to get number of burglaries for each year per hour
burglaries_by_time = burglaries_by_time.apply(lambda x: x/5)
# Divide value for each day by 8760 (number of hours in a year) to get normalized number of burglaries by time
burglaries_by_time = burglaries_by_time.apply(lambda x: x/8760)
burglaries_by_time
Out[48]:
0     0.056256
1     0.035913
2     0.035228
3     0.037329
4     0.035571
5     0.038836
6     0.045297
7     0.068311
8     0.083174
9     0.083082
10    0.073539
11    0.065251
12    0.084384
13    0.062945
14    0.069749
15    0.071963
16    0.064886
17    0.071210
18    0.066872
19    0.059132
20    0.053447
21    0.052100
22    0.054749
23    0.042900
Name: time, dtype: float64
In [56]:
burglaries_by_time.plot(kind='bar', rot=0, title="Chicago Burglaries by Time of the Day", figsize=(14, 8))
plt.ylabel("Average Number of Burglaries")
plt.xlabel("Time (in hours)")
plt.show()

It might be expected that most burglaries occur at nightime. However, according to the above bar plot, most burglaries in Chicago occur around 8am, 9am, and 12pm. In fact, almost one burglary occurs at these times every day. Burglaries are least likely to occur from 1am to 6am. A possible reason for this trend could be the same as above. There seems to be an increase in the number of burglaries at the times when most people leave home for work. It looks like empty homes are targets preferred by burglars.

These observations seem interesting. Let's also visualize them in a line plot.

In [57]:
burglaries_by_time.plot(rot=0, title="Burglaries by Time of the Day", figsize=(10, 8))
plt.ylabel("Average Number of Burglaries")
plt.xlabel("Time (in hours)")
plt.show()

This line plot confirms the observations from our bar plot and shows the big difference in burglary occurrence between different times of the day.

Now that we have determined when burglaries are most likely to occur, let's observe where they are most likely to occur.

We will do this by creating an interactive heat map indicating the areas of Chicago with the highest concentration of burglaries.

The burglary table contains a very large number of datapoints, which would make our heatmap ugly and unreadable. To improve readability and accuracy, we will be using a random sample of size 10,000.

In [58]:
# Take random sample of 10,000 rows
sample_table = burglary_table.sample(n=10000)
# Display first 5 rows
sample_table.head()
Out[58]:
id primary_type description arrest location latitude longitude date year month day time
1229673 11788747 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.996119599', 'longitude': '-87... 41.996119599 -87.709314258 2019-08-09T17:30:00.000 2019 8 4 17
436285 10651454 BURGLARY HOME INVASION False {'latitude': '41.873848656', 'longitude': '-87... 41.873848656 -87.762720823 2016-08-21T17:50:00.000 2016 8 6 17
710630 11068464 BURGLARY FORCIBLE ENTRY True {'latitude': '41.746259974', 'longitude': '-87... 41.746259974 -87.663332294 2017-08-26T21:15:00.000 2017 8 5 21
258330 10356768 BURGLARY FORCIBLE ENTRY False {'latitude': '41.944512748', 'longitude': '-87... 41.944512748 -87.650600843 2015-12-22T05:00:00.000 2015 12 1 5
472192 10707176 BURGLARY UNLAWFUL ENTRY False {'latitude': '41.735921963', 'longitude': '-87... 41.735921963 -87.712649148 2016-10-06T09:00:00.000 2016 10 3 9

To map our sample, we will be using the folium package

In [59]:
# Create map
map_osm = folium.Map(location=[41.88, -87.63], zoom_start=11)
# Drop rows where location is missing
heat_table = sample_table[sample_table['location'].notna()]
# Get heat data from sample
heat_data = [[row['latitude'], row['longitude']] for index, row in heat_table.iterrows()]
# Create heat map
HeatMap(heat_data, radius=20).add_to(map_osm)
    
map_osm
Out[59]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Now let's make our map a bit more descriptive, by adding some more data.

We will be creating circles, indicating the location of each burglary. By clicking on the circles, one will be able to see the date and time of the incident. Additionally, green circles will indicate burglaries where the offender has been arrested, while black circles will mean that the offender has not been arrested.

In [61]:
# Add circles
for index, row in heat_table.iterrows():
    color=''
    if row['arrest'] == True:
        color = 'green'
    else:
        color = 'black'
        
    folium.Circle(
    radius = 20,
    location = [row['latitude'], row['longitude']],
    popup = row['date'],
    color = color,
    fill = True,
).add_to(map_osm)
    
map_osm
Out[61]:
Make this Notebook Trusted to load map: File -> Trust Notebook

A considerably large proportion of the circles on the map above are black. This means that most burglars never get arrested. Lets visualize this in a pie chart.

In [74]:
burglary_table['arrest'].value_counts().plot(kind='pie', figsize=(8, 8), title="Burglars Arrested", autopct='%1.1f%%')
plt.ylabel("")
plt.show()

From the plot above, we can see that a surprisingly low percentage of burglars get arrested. 94.8% of them never get caught.

Now let's plot some other interesting statistics, using our complementary datasets.

We will use pie charts to visualize the age, sex, and race distributions of burglary offenders and victims.

In [73]:
offender_age[:7].plot(kind='pie', y='Value', figsize=(8,8), autopct='%1.1f%%', title="Burglars by Age")
plt.ylabel("")
plt.show()
In [75]:
offender_sex.plot(kind='pie', y='Value', figsize=(8,8), autopct='%1.1f%%', title="Burglars by Sex")
plt.ylabel("")
plt.show()
In [76]:
offender_race.plot(kind='pie', y='Value', figsize=(8,8), autopct='%1.1f%%', title="Burglars by Race")
plt.ylabel("")
plt.show()
In [78]:
victim_age[:10].plot(kind='pie', y='Value', figsize=(8,8), autopct='%1.1f%%', title="Victims by Age")
plt.ylabel("")
plt.show()
In [79]:
victim_sex.plot(kind='pie', y='Value', figsize=(8,8), autopct='%1.1f%%', title="Victims by Sex")
plt.ylabel("")
plt.show()
In [80]:
victim_race[:5].plot(kind='pie', y='Value', figsize=(8,8), autopct='%1.1f%%', title="Victims by Race")
plt.ylabel("")
plt.show()
In [84]:
victim_offender_relationship[:12].plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Relationship between Victim and Offender", subplots=True)
plt.ylabel("")
plt.show()
In [83]:
linked_offenses[:15].plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Offenders linked to other offenses", subplots=True)
plt.ylabel("")
plt.show()

4. Insight & Observations

For the last stage of the data lifecycle, we will be utilizing the analysis we conducted to derive some insights and observations about burglaries in Chicago.

The number of burglaries seems to be decreasing every year as of 2016, meaning that Chicago is becoming a safer place to live.

Most burglaries occur in the summer months, on weekdays, between 8am and 12pm. Our analysis of burglaries by month, day, and time seem to aggree with each other and all confirm the same assumption: Burglars prefer vacant homes, where the chance of confrontation is decreased.

The highest concentration of burglaries seems to be in the center of the city. Other than that, there does not look to be any other obvious trend. An educated assumption would be that wealthier, less-secure households have a higher chance of being burglarized.

The majority of burglars are white males between the ages 20-29 and around 95% of them never get arrested.

A considerable amount of burglary victims seem to know the burglar in some way. Only 19% of burglary victims have reported the burglar as a complete stranger.

Amongst burglars who were also linked with another offense, 50% of them had been involved in Destruction/Damage/Vandalism of Property.